{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Effect of transforming the targets in regression model\n\n\nIn this example, we give an overview of the\n:class:`sklearn.compose.TransformedTargetRegressor`. Two examples\nillustrate the benefit of transforming the targets before learning a linear\nregression model. The first example uses synthetic data while the second\nexample is based on the Boston housing data set.\n\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Author: Guillaume Lemaitre <guillaume.lemaitre@inria.fr>\n# License: BSD 3 clause\n\n\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom distutils.version import LooseVersion\n\nprint(__doc__)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Synthetic example\n##############################################################################\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import make_regression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import RidgeCV\nfrom sklearn.compose import TransformedTargetRegressor\nfrom sklearn.metrics import median_absolute_error, r2_score\n\n\n# `normed` is being deprecated in favor of `density` in histograms\nif LooseVersion(matplotlib.__version__) >= '2.1':\n    density_param = {'density': True}\nelse:\n    density_param = {'normed': True}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A synthetic random regression problem is generated. The targets ``y`` are\nmodified by: (i) translating all targets such that all entries are\nnon-negative and (ii) applying an exponential function to obtain non-linear\ntargets which cannot be fitted using a simple linear model.\n\nTherefore, a logarithmic (`np.log1p`) and an exponential function\n(`np.expm1`) will be used to transform the targets before training a linear\nregression model and using it for prediction.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X, y = make_regression(n_samples=10000, noise=100, random_state=0)\ny = np.exp((y + abs(y.min())) / 200)\ny_trans = np.log1p(y)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following illustrate the probability density functions of the target\nbefore and after applying the logarithmic functions.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "f, (ax0, ax1) = plt.subplots(1, 2)\n\nax0.hist(y, bins=100, **density_param)\nax0.set_xlim([0, 2000])\nax0.set_ylabel('Probability')\nax0.set_xlabel('Target')\nax0.set_title('Target distribution')\n\nax1.hist(y_trans, bins=100, **density_param)\nax1.set_ylabel('Probability')\nax1.set_xlabel('Target')\nax1.set_title('Transformed target distribution')\n\nf.suptitle(\"Synthetic data\", y=0.035)\nf.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "At first, a linear model will be applied on the original targets. Due to the\nnon-linearity, the model trained will not be precise during the\nprediction. Subsequently, a logarithmic function is used to linearize the\ntargets, allowing better prediction even with a similar linear model as\nreported by the median absolute error (MAE).\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "f, (ax0, ax1) = plt.subplots(1, 2, sharey=True)\n\nregr = RidgeCV()\nregr.fit(X_train, y_train)\ny_pred = regr.predict(X_test)\n\nax0.scatter(y_test, y_pred)\nax0.plot([0, 2000], [0, 2000], '--k')\nax0.set_ylabel('Target predicted')\nax0.set_xlabel('True Target')\nax0.set_title('Ridge regression \\n without target transformation')\nax0.text(100, 1750, r'$R^2$=%.2f, MAE=%.2f' % (\n    r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))\nax0.set_xlim([0, 2000])\nax0.set_ylim([0, 2000])\n\nregr_trans = TransformedTargetRegressor(regressor=RidgeCV(),\n                                        func=np.log1p,\n                                        inverse_func=np.expm1)\nregr_trans.fit(X_train, y_train)\ny_pred = regr_trans.predict(X_test)\n\nax1.scatter(y_test, y_pred)\nax1.plot([0, 2000], [0, 2000], '--k')\nax1.set_ylabel('Target predicted')\nax1.set_xlabel('True Target')\nax1.set_title('Ridge regression \\n with target transformation')\nax1.text(100, 1750, r'$R^2$=%.2f, MAE=%.2f' % (\n    r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))\nax1.set_xlim([0, 2000])\nax1.set_ylim([0, 2000])\n\nf.suptitle(\"Synthetic data\", y=0.035)\nf.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Real-world data set\n##############################################################################\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In a similar manner, the boston housing data set is used to show the impact\nof transforming the targets before learning a model. In this example, the\ntargets to be predicted corresponds to the weighted distances to the five\nBoston employment centers.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import load_boston\nfrom sklearn.preprocessing import QuantileTransformer, quantile_transform\n\ndataset = load_boston()\ntarget = np.array(dataset.feature_names) == \"DIS\"\nX = dataset.data[:, np.logical_not(target)]\ny = dataset.data[:, target].squeeze()\ny_trans = quantile_transform(dataset.data[:, target],\n                             n_quantiles=300,\n                             output_distribution='normal',\n                             copy=True).squeeze()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "A :class:`sklearn.preprocessing.QuantileTransformer` is used such that the\ntargets follows a normal distribution before applying a\n:class:`sklearn.linear_model.RidgeCV` model.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "f, (ax0, ax1) = plt.subplots(1, 2)\n\nax0.hist(y, bins=100, **density_param)\nax0.set_ylabel('Probability')\nax0.set_xlabel('Target')\nax0.set_title('Target distribution')\n\nax1.hist(y_trans, bins=100, **density_param)\nax1.set_ylabel('Probability')\nax1.set_xlabel('Target')\nax1.set_title('Transformed target distribution')\n\nf.suptitle(\"Boston housing data: distance to employment centers\", y=0.035)\nf.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The effect of the transformer is weaker than on the synthetic data. However,\nthe transform induces a decrease of the MAE.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "f, (ax0, ax1) = plt.subplots(1, 2, sharey=True)\n\nregr = RidgeCV()\nregr.fit(X_train, y_train)\ny_pred = regr.predict(X_test)\n\nax0.scatter(y_test, y_pred)\nax0.plot([0, 10], [0, 10], '--k')\nax0.set_ylabel('Target predicted')\nax0.set_xlabel('True Target')\nax0.set_title('Ridge regression \\n without target transformation')\nax0.text(1, 9, r'$R^2$=%.2f, MAE=%.2f' % (\n    r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))\nax0.set_xlim([0, 10])\nax0.set_ylim([0, 10])\n\nregr_trans = TransformedTargetRegressor(\n    regressor=RidgeCV(),\n    transformer=QuantileTransformer(n_quantiles=300,\n                                    output_distribution='normal'))\nregr_trans.fit(X_train, y_train)\ny_pred = regr_trans.predict(X_test)\n\nax1.scatter(y_test, y_pred)\nax1.plot([0, 10], [0, 10], '--k')\nax1.set_ylabel('Target predicted')\nax1.set_xlabel('True Target')\nax1.set_title('Ridge regression \\n with target transformation')\nax1.text(1, 9, r'$R^2$=%.2f, MAE=%.2f' % (\n    r2_score(y_test, y_pred), median_absolute_error(y_test, y_pred)))\nax1.set_xlim([0, 10])\nax1.set_ylim([0, 10])\n\nf.suptitle(\"Boston housing data: distance to employment centers\", y=0.035)\nf.tight_layout(rect=[0.05, 0.05, 0.95, 0.95])\n\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}